feat: Add a parquet uuid calculation #3440

NJManganelli · 2025-03-31T04:27:03Z

Calculate a uuid from parquet metadata, utilizing detailed info of the first and last row_groups plus the col_counts of all row_groups of the file or dataset. At the column-page level, parquet should have a checksum AFAIK, but an approximate calculation that would catch differences in numbers of rows, row groups, columns, compression, etc. that deterministically uses two row groups should be sufficient for the equivalent of what coffea does with root files (which is flag them for changes to recalculate the form, steps, etc.).

https://github.com/scikit-hep/coffea/blob/master/src/coffea/dataset_tools/preprocess.py#L46-L48

Also, the ParquetMetadata namedtuple doesn't appear to be used, at least in this file that's touched. Given there's an extra line to handle not changing the length of returned tuple to try and avoid breaking compatibility with outside users, maybe this should be deprecated and the namedtuple should be used instead?

codecov · 2025-03-31T05:25:26Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.44%. Comparing base (b749e49) to head (6f57cec).
Report is 324 commits behind head on main.

Additional details and impacted files

Files with missing lines	Coverage Δ
src/awkward/operations/ak_from_parquet.py	`93.42% <100.00%> (+2.37%)`	⬆️
src/awkward/operations/ak_metadata_from_parquet.py	`100.00% <100.00%> (ø)`

... and 190 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ianna · 2025-04-17T08:31:18Z

@NJManganelli - what is the status of this PR? Are you still working on it? Thanks!

NJManganelli · 2025-04-18T14:02:46Z

Hi @ianna I'll add a test, then I think it'll be ready from my side

NJManganelli · 2025-04-18T14:40:28Z

Not without a performance penalty, but if it needs to be optimized, we could figure out a smarter but still sufficient calculation (I'd like to ensure that any changes in compression, columns, rows is captured). It's also possible that I didn't explore enough some checksum information that's supposed to be available (but these I think were at the page level or something, and just the loop over all those seems like it would be much worse than this)

Without uuid:

python -m timeit -n 1000 "import test_3440_calculate_parquet_uuid; test_3440_calculate_parquet_uuid.test_parquet_uuid()"
1000 loops, best of 5: 83 usec per loop

With uuid:

tests git:(parquet_uuid) ✗ python -m timeit -n 1000 "import test_3440_calculate_parquet_uuid; test_3440_calculate_parquet_uuid.test_parquet_uuid()"
1000 loops, best of 5: 209 usec per loop

…e first and last row_groups plus the col_counts of all row_groups of the file or dataset

NJManganelli · 2025-04-18T14:55:26Z

Rebased for my own sanity, and marked ready (presuming all the tests are going to pass, will fix if otherwise)

ianna

@NJManganelli - it looks like uuids do not match:

______________________________ test_parquet_uuid _______________________________

    def test_parquet_uuid():
        meta = metadata_from_parquet(input)
>       assert (
            meta["uuid"]
            == "93fd596534251b1ac750f359d9d55418b02bcddcc79cd5ab1e3d9735941fbcae"
        )
E       AssertionError: assert 'adc236484e10384ad680a1ae2ce1fc675f6f9f0a84a02c651ed91ad97fba41c0' == '93fd596534251b1ac750f359d9d55418b02bcddcc79cd5ab1e3d9735941fbcae'
E         
E         - 93fd596534251b1ac750f359d9d55418b02bcddcc79cd5ab1e3d9735941fbcae
E         + adc236484e10384ad680a1ae2ce1fc675f6f9f0a84a02c651ed91ad97fba41c0

meta       = {'col_counts': [5],
 'columns': ['u1', 'u4', 'u8', 'f4', 'f8', 'raw', 'utf8'],
 'form': RecordForm([BitMaskedForm('u8', NumpyForm('bool'), True, True), BitMaskedForm('u8', NumpyForm('int32'), True, True), BitMaskedForm('u8', NumpyForm('int64'), True, True), BitMaskedForm('u8', NumpyForm('float32'), True, True), BitMaskedForm('u8', NumpyForm('float64'), True, True), BitMaskedForm('u8', ListOffsetForm('i32', NumpyForm('uint8', parameters={'__array__': 'byte'}), parameters={'__array__': 'bytestring'}), True, True), BitMaskedForm('u8', ListOffsetForm('i32', NumpyForm('uint8', parameters={'__array__': 'char'}), parameters={'__array__': 'string'}), True, True)], ['u1', 'u4', 'u8', 'f4', 'f8', 'raw', 'utf8']),
 'fs': <fsspec.implementations.local.LocalFileSystem object at 0x7ff63efc9340>,
 'num_row_groups': 1,
 'num_rows': 5,
 'paths': ['/home/runner/work/awkward/awkward/tests/samples/nullable-record-primitives.parquet'],
 'uuid': 'adc236484e10384ad680a1ae2ce1fc675f6f9f0a84a02c651ed91ad97fba41c0'}

tests/test_3440_calculate_parquet_uuid.py:22: AssertionError

NJManganelli · 2025-04-21T19:42:57Z

Aye, looks like this will need to be more selective of what goes into the hash, from that printout. I’ll have a look when I am back from holidays

…ulnerable to OS-specific effects

… for OS-agnostic uuid

NJManganelli · 2025-04-24T20:45:56Z

@ianna I'm trying a more selective set of key-value pairs, hoping it'll be more stable, but "it works on my machine" the same as the previous one, so need to see what the tests say I think

ianna · 2025-04-25T06:50:44Z

@all-contributors please add @NJManganelli for code

allcontributors · 2025-04-25T06:50:54Z

@ianna

I've put up a pull request to add @NJManganelli! 🎉

ianna · 2025-04-25T07:00:36Z

strange:

ImportError: to use ak.from_parquet, you must install pyarrow:

    pip install pyarrow

or

    conda install -c conda-forge pyarrow

tests/test_3440_calculate_parquet_uuid.py

NJManganelli · 2025-04-25T15:44:50Z

Nope, it is not actually producing a stable hash, so it's still not a uuid. I don't know if the dictionary keys might not be sorting the same every time, either because of parquet data ingestion or something else, if some of the values are not stable, or something else entirely.

How should I iterate on this? temporarily add in a print out of all the intermediate info that goes into the hash and find the difference when a test fails?

NJManganelli · 2025-04-27T14:46:11Z

@ianna could tests be rerun with these debug commits? I couldn't think of another way to discern what is os-dependent or not deterministic, i always get the same hash on my system

… value can be None or 0 depending on versions, sorting_columns also different and removed from this list

NJManganelli · 2025-04-29T06:18:10Z

Found two problems, in the "columns" key a field called "statistics" stores extra info like min, max, etc. distinct_counts is None or 0 in the minimal/full ubuntu install. the sorting columns is also missing in one. Let's see if that's everything or not, and if so I'll remove the two debug commits / update the test hash

… still desired

NJManganelli · 2025-04-29T20:59:55Z

Alright, 17th time is the charm, as they say

NJManganelli · 2025-04-30T01:44:45Z

Thanks, @ikrommyd
Passed this time, if we later discover any skew that maybe pops up with pyarrow changes, I’ll address it then.

NJManganelli · 2025-05-01T16:51:20Z

@ianna a anything else needed?

ianna

@NJManganelli - great! Thanks. I’m merging it.

NJManganelli changed the title ~~Add a parquet uuid calculation~~ feat: Add a parquet uuid calculation Mar 31, 2025

Nick Manganelli added 3 commits April 18, 2025 09:54

Calculate a uuid from parquet metadata, utilizing detailed info of th…

edd06ee

…e first and last row_groups plus the col_counts of all row_groups of the file or dataset

pre-commit fixups

667de50

test for parquet uuid

eab037d

NJManganelli force-pushed the parquet_uuid branch from 5000ce8 to eab037d Compare April 18, 2025 14:54

NJManganelli marked this pull request as ready for review April 18, 2025 14:55

ianna requested changes Apr 21, 2025

View reviewed changes

Nick Manganelli and others added 3 commits April 24, 2025 15:43

try more selective key-value pairs for hashing, which might be less v…

b9db5a3

…ulnerable to OS-specific effects

Update testing hash and print uuid in case this still is insufficient…

350e350

… for OS-agnostic uuid

style: pre-commit fixes

dca9c8c

NJManganelli requested a review from ianna April 24, 2025 20:46

Merge branch 'main' into parquet_uuid

796e25b

allcontributors bot mentioned this pull request Apr 25, 2025

docs: add NJManganelli as a contributor for code #3481

Merged

Merge branch 'main' into parquet_uuid

b5a1cdb

ariostas reviewed Apr 25, 2025

View reviewed changes

tests/test_3440_calculate_parquet_uuid.py Show resolved Hide resolved

explicit importskip for pyarrow.parquet

eeea299

Nick Manganelli and others added 3 commits April 26, 2025 01:25

Debug commit, prints

b12eb18

Debug commit, fail assertion

47da681

style: pre-commit fixes

9fd67b8

ianna and others added 2 commits April 28, 2025 10:14

Merge branch 'main' into parquet_uuid

c1a0da4

Remove columns::statistics which contains a distinct_counts key whose…

b4f128c

… value can be None or 0 depending on versions, sorting_columns also different and removed from this list

Nick Manganelli and others added 3 commits April 29, 2025 15:55

Remove DEBUG print statements, but not a full revert since one change…

761934a

… still desired

Updated hash for parquet uuid

fae5dde

style: pre-commit fixes

55f840e

Merge branch 'main' into parquet_uuid

1982504

Merge branch 'main' into parquet_uuid

6f57cec

ianna approved these changes May 1, 2025

View reviewed changes

ianna merged commit 798d0ee into scikit-hep:main May 1, 2025
43 checks passed

feat: Add a parquet uuid calculation #3440

feat: Add a parquet uuid calculation #3440

Uh oh!

Conversation

NJManganelli commented Mar 31, 2025

Uh oh!

codecov bot commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ianna commented Apr 17, 2025

Uh oh!

NJManganelli commented Apr 18, 2025

Uh oh!

NJManganelli commented Apr 18, 2025

Uh oh!

NJManganelli commented Apr 18, 2025

Uh oh!

ianna left a comment

Choose a reason for hiding this comment

Uh oh!

NJManganelli commented Apr 21, 2025

Uh oh!

NJManganelli commented Apr 24, 2025

Uh oh!

ianna commented Apr 25, 2025

Uh oh!

allcontributors bot commented Apr 25, 2025

Uh oh!

ianna commented Apr 25, 2025

Uh oh!

Uh oh!

NJManganelli commented Apr 25, 2025

Uh oh!

NJManganelli commented Apr 27, 2025

Uh oh!

NJManganelli commented Apr 29, 2025

Uh oh!

NJManganelli commented Apr 29, 2025

Uh oh!

NJManganelli commented Apr 30, 2025

Uh oh!

NJManganelli commented May 1, 2025

Uh oh!

ianna left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Mar 31, 2025 •

edited

Loading